Layout Based Information Retrieval from Document Images
نویسنده
چکیده
This research is intended to develop a layout based retrieval system for document image databases consisting of three phases: 1. At first, intelligent layout analysis algorithm has been designed to extract the layouts the document images physically with their edges and rectangles. 2. Every physically identified layout has been converted into a tree intermediary representation for indexing and storage in layout databases. Later when a query image has been supplied, it retrieves the similar layout images from the layout databases. 3. Finally, a logical layout analysis scheme has been proposed to identify the meaning of the layouts involved in the document images. In intelligent layout analysis system, White Space Analysis technique has been proposed to grab all the white spaces over the image in a single scan over the image with minimum pixel visits, and the white spaces are merged together without the assumptions of heuristics and threshold to segment the layouts. Moreover, two statistical properties have also been proposed in this thesis, to separate the text blocks and images from the identified layouts. In Layout based retrieval system, Tree Representation for physical layouts in a document image and indexing has been proposed in this thesis to enable the retrieval of document images based on layout similarities. This allows a user to find pages with a layout of similar page to the query image supplied by the user based on the tree structure. In logical layout analysis, this thesis attempts to analyze the layouts of the document images logically as title blocks, sub titles, text, images, content lines etc.. to produce more meaningful information from the document images.
منابع مشابه
Document Image Retrieval Based on Keyword Spotting Using Relevance Feedback
Keyword Spotting is a well-known method in document image retrieval. In this method, Search in document images is based on query word image. In this Paper, an approach for document image retrieval based on keyword spotting has been proposed. In proposed method, a framework using relevance feedback is presented. Relevance feedback, an interactive and efficient method is used in this paper to imp...
متن کاملDocument Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملClassification of document page images based on visual similarity of layout structures
Searching for documents by their type or genre is a natural way to enhance the eeectiveness of document retrieval. The layout of a document contains a signiicant amount of information that can be used to classify a document's type in the absence of domain speciic models. A document type or genre can be deened by the user based primarily on layout structure. Our classiication approach is based o...
متن کاملRetrieval by Layout Similarity of Documents Represented with MXY Trees
Document image retrieval can be carried out either processing the converted text (obtained with OCR) or by measuring the layout similarity of images. We describe a system for document image retrieval based on layout similarity. The layout is described by means of a treebased representation: the Modified X-Y tree. Each page in the database is represented by a feature vector containing both globa...
متن کاملA Clustering-Based Algorithm for Automatic Document Separation
For text, audio, video, and still images, a number of projects have addressed the problem of estimating inter-object similarity and the related problem of finding transition, or ‘segmentation’ points in a stream of objects of the same media type. There has been relatively little work in this area for document images, which are typically text-intensive and contain a mixture of layout, text-based...
متن کامل